In this notebook I'll fit a Support Vector Machine (SVM) classifer to data using scikit-learn. Using different kernels and experimenting on how kernels produce nonlinear decision surfaces. I'll also predict the labels for datapoints and measure the performance of the SVM.
#Loading the appropriate packages
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.graph_objs as go
#from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
#init_notebook_mode(connected=True)
#turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)
The data is from a CSV file.
This dataset is the breast cancer Wisconsin (diagnostic) dataset which contains 30 different features computed from a images of a fine needle aspirate (FNA) of breast masses for 569 patients with each example labeled as being a benign or malignant mass.
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
df = pd.read_csv('data_svms_and_kernels.csv')
df
Now I'll extract data from the dataframe in NumPy arrays and using LabelEncoder from scikit-learn to transform labels into $\{-1,+1\}$:
X = df.drop('Label', axis=1).to_numpy()
y_text = df['Label'].to_numpy()
y = (2 * LabelEncoder().fit_transform(y_text)) - 1
Let's check X, y_text and y:
X.shape
y_text
y
Scatter ploting the data:
points_colorscale = [
[0.0, 'rgb(239, 85, 59)'],
[1.0, 'rgb(99, 110, 250)'],
]
points = go.Scatter(
x=df['Feature 1'],
y=df['Feature 2'],
mode='markers',
marker=dict(color=y,
colorscale=points_colorscale)
)
layout = go.Layout(
xaxis=dict(range=[-1.05, 1.05]),
yaxis=dict(range=[-1.05, 1.05])
)
fig = go.Figure(data=[points], layout=layout)
fig.show()
It's time to split data into training, validation and test sets. Let's use 60% for training, 20% for validation and 20% for test data.
(X_train, X_vt, y_train, y_vt) = train_test_split(X, y, test_size=0.4, random_state=0)
(X_validation, X_test, y_validation, y_test) = train_test_split(X_vt, y_vt, test_size=0.5, random_state=0)
I'll use the SVC class from scikit-learn. For now I am not using kernels, so I set the kernel argument of SVC to 'linear'.
svm = SVC(kernel='linear')
# fit svm to X_train and y_train
svm.fit(X_train, y_train)
Code a function to plot decision surface:
def svm_show(svm):
decision_colorscale = [
[0.0, 'rgb(239, 85, 59)'],
[0.5, 'rgb( 0, 0, 0)'],
[1.0, 'rgb( 99, 110, 250)']
]
detail_steps = 100
(x_vis_0_min, x_vis_1_min) = (-1.05, -1.05) #X_train.min(axis=0)
(x_vis_0_max, x_vis_1_max) = ( 1.05, 1.05) #X_train.max(axis=0)
x_vis_0_range = np.linspace(x_vis_0_min, x_vis_0_max, detail_steps)
x_vis_1_range = np.linspace(x_vis_1_min, x_vis_1_max, detail_steps)
(XX_vis_0, XX_vis_1) = np.meshgrid(x_vis_0_range, x_vis_0_range)
X_vis = np.c_[XX_vis_0.reshape(-1), XX_vis_1.reshape(-1)]
YY_vis = svm.decision_function(X_vis).reshape(XX_vis_0.shape)
points = go.Scatter(
x=df['Feature 1'],
y=df['Feature 2'],
mode='markers',
marker=dict(
color=y,
colorscale=points_colorscale),
showlegend=False
)
SVs = svm.support_vectors_
support_vectors = go.Scatter(
x=SVs[:, 0],
y=SVs[:, 1],
mode='markers',
marker=dict(
size=15,
color='black',
opacity = 0.1,
colorscale=points_colorscale),
line=dict(dash='solid'),
showlegend=False
)
decision_surface = go.Contour(x=x_vis_0_range,
y=x_vis_1_range,
z=YY_vis,
contours_coloring='lines',
line_width=2,
contours=dict(
start=0,
end=0,
size=1),
colorscale=decision_colorscale,
showscale=False
)
margins = go.Contour(x=x_vis_0_range,
y=x_vis_1_range,
z=YY_vis,
contours_coloring='lines',
line_width=2,
contours=dict(
start=-1,
end=1,
size=2),
line=dict(dash='dash'),
colorscale=decision_colorscale,
showscale=False
)
fig2 = go.Figure(data=[margins, decision_surface, support_vectors, points], layout=layout)
return fig2.show()
Let's visualize the decision surface the svm with its supprt vectors:
svm_show(svm)
The datapoints, the decision surface (which is a line here), the margins and the support vectors are shown in the plot.
As we can see in the plot, the decision surface is underfiiting the data. Let's use a polynomial kernel. I define svm_p2 to be an instance of class SVC but this time with arguments kernel='poly' and degree=2 to define a degree-2 polynomial kernel:
svm_p2 = SVC(kernel='poly', degree=2)
# fit it to your training data:
svm_p2.fit(X_train, y_train)
# visualize this model with the function svm_show
svm_show(svm_p2)
Looks better. But let's try a degree 3 model. svm_p3 with degree=3 this time:
svm_p3 = SVC(kernel='poly', degree=3)
svm_p3.fit(X_train, y_train)
svm_show(svm_p3)
Let's try a RBF (Radial Basis Function) kernel as well. RBFs are the default kernel for scikit-learn's SVC. I'll build a model svm_r with either kernel=rbf argument setting (the default one) so just skip the kernel (also the degree argument is uselss here, since we are not using a polynomial kernel, we just skip that):
svm_r = SVC()
svm_r.fit(X_train,y_train)
svm_show(svm_r)
How to pick the best model then? I'll use the validation data. Let's predict using svm and X_train and assign it the name yhat_train. Also, I'll predict X_validation and assign it the name yhat_validation (. The closeness of the accuracy of predictions on these two datasets will be helpful):
models = [svm,svm_p2,svm_p3,svm_r]
names_y_train = ['yhat_train','yhat_train_p2','yhat_train_p3','yhat_train_r']
names_y_validation = ['yhat_validation','yhat_validation_p2','yhat_validation_p3','yhat_validation_r']
for a,b,c in zip(models, names_y_train, names_y_validation):
globals()[b] = a.predict(X_train)
globals()[c] = a.predict(X_validation)
print('accuracy_score(yhat_train, y_train), accuracy_score(yhat_validation, y_validation)\n',
accuracy_score(yhat_train, y_train),'\t\t\t',
accuracy_score(yhat_validation, y_validation),'\n')
print('accuracy_score(yhat_train_p2, y_train), accuracy_score(yhat_validation_p2, y_validation)\n',
accuracy_score(yhat_train_p2, y_train),'\t\t\t',
accuracy_score(yhat_validation_p2, y_validation),'\n')
print('accuracy_score(yhat_train_p3, y_train), accuracy_score(yhat_validation_p3, y_validation)\n',
accuracy_score(yhat_train_p3, y_train),'\t\t\t',
accuracy_score(yhat_validation_p3, y_validation),'\n')
print('accuracy_score(yhat_train_r, y_train), accuracy_score(yhat_validation_r, y_validation)\n',
accuracy_score(yhat_train_r, y_train),'\t\t\t',
accuracy_score(yhat_validation_r, y_validation),'\n')
From all these number we can see that the RBF model works best as the accuracy on validation data is high and also the gap between the accuracy on training and validation data is not big. We could further tune the generalization power of our model by tuning the argument C of SVC which is the inverse of a regularization coefficient.
Finally, let's check accuracy on the test data to get a final performance number. Predict yhat_test_r from X_test on svm_r:
yhat_test_r = svm_r.predict(X_test)
accuracy_score(yhat_test_r, y_test)
We have good performance of test data.